{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Clustering\n",
    "\n",
    "Clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense. Clustering is a method of unsupervised learning, and a common technique for statistical data analysis used in many fields.\n",
    "\n",
    "Hierarchical algorithms find successive clusters using previously established clusters. These algorithms usually are either agglomerative (\"bottom-up\") or divisive (\"top-down\"). Agglomerative algorithms begin with each element as a separate cluster and merge them into successively larger clusters. Divisive algorithms begin with the whole set and proceed to divide it into successively smaller clusters.\n",
    "\n",
    "Partitional algorithms typically determine all clusters at once, but can also be used as divisive algorithms in the hierarchical clustering. Many partitional clustering algorithms require the specification of the number of clusters to produce in the input data set, prior to execution of the algorithm. Barring knowledge of the proper value beforehand, the appropriate value must be determined, a problem on its own for which a number of techniques have been developed.\n",
    "\n",
    "Density-based clustering algorithms are devised to discover arbitrary-shaped clusters. In this approach, a cluster is regarded as a region in which the density of data objects exceeds a threshold.\n",
    "\n",
    "Subspace clustering methods look for clusters that can only be seen in a particular projection (subspace, manifold) of the data. These methods thus can ignore irrelevant attributes. The general problem is also known as Correlation clustering while the special case of axis-parallel subspaces is also known as two-way clustering, co-clustering or biclustering in bioinformatics: in these methods not only the objects are clustered but also the features of the objects, i.e., if the data is represented in a data matrix, the rows and columns are clustered simultaneously. They usually do not however work with arbitrary feature combinations as in general subspace methods."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "import $ivy.`com.github.haifengl::smile-scala:4.0.0`\n",
    "import $ivy.`org.slf4j:slf4j-simple:2.0.16`  \n",
    "\n",
    "import scala.language.postfixOps\n",
    "import smile._\n",
    "import smile.math._\n",
    "import smile.math.distance._\n",
    "import smile.math.kernel._\n",
    "import smile.math.matrix._\n",
    "import smile.math.matrix.Matrix._\n",
    "import smile.math.rbf._\n",
    "import smile.stat.distribution._\n",
    "import smile.data._\n",
    "import smile.data.formula._\n",
    "import smile.data.measure._\n",
    "import smile.data.`type`._\n",
    "import smile.clustering._\n",
    "\n",
    "import java.awt.Color.{BLACK, BLUE, CYAN, DARK_GRAY, GRAY, GREEN, LIGHT_GRAY, MAGENTA, ORANGE, PINK, RED, WHITE, YELLOW}\n",
    "import smile.plot.swing.Palette.{DARK_RED, VIOLET_RED, DARK_GREEN, LIGHT_GREEN, PASTEL_GREEN, FOREST_GREEN, GRASS_GREEN, NAVY_BLUE, SLATE_BLUE, ROYAL_BLUE, CADET_BLUE, MIDNIGHT_BLUE, SKY_BLUE, STEEL_BLUE, DARK_BLUE, DARK_MAGENTA, DARK_CYAN, PURPLE, LIGHT_PURPLE, DARK_PURPLE, GOLD, BROWN, SALMON, TURQUOISE, BURGUNDY, PLUM}\n",
    "import smile.plot.swing._\n",
    "import smile.plot.show\n",
    "import smile.plot.Render._"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Hierarchical Clustering\n",
    "\n",
    "Agglomerative hierarchical clustering seeks to build a hierarchy of clusters in a bottom up approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. The results of hierarchical clustering are usually presented in a dendrogram.\n",
    "\n",
    "In general, the merges are determined in a greedy manner. In order to decide which clusters should be combined, a measure of dissimilarity between sets of observations is required. In most methods of hierarchical clustering, this is achieved by use of an appropriate metric, and a linkage criteria which specifies the dissimilarity of sets as a function of the pairwise distances of observations in the sets.\n",
    "\n",
    "Hierarchical clustering has the distinct advantage that any valid measure of distance can be used. In fact, the observations themselves are not required: all that is used is a matrix of distances."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "val x = read.csv(\"../data/clustering/rem.txt\", header = false, delimiter = \" \").toArray()\n",
    "val clusters = hclust(x, \"complete\")\n",
    "show(dendrogram(clusters))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## K-Means\n",
    "\n",
    "K-Means clustering partitions n observations into k clusters in which each observation belongs to the cluster with the nearest mean. Although finding an exact solution to the K-Means problem for arbitrary input is NP-hard, the standard approach to finding an approximate solution (often called Lloyd's algorithm or the K-Means algorithm) is used widely and frequently finds reasonable solutions quickly.\n",
    "\n",
    "K-Means is a hard clustering method, i.e. each sample is assigned to a specific cluster. In contrast, soft clustering, e.g. the Expectation-Maximization algorithm for Gaussian mixtures, assign samples to different clusters with different probabilities.\n",
    "\n",
    "The K-Means algorithm has at least two major theoretic shortcomings:\n",
    "\n",
    "First, it has been shown that the worst case running time of the algorithm is super-polynomial in the input size.\n",
    "Second, the approximation found can be arbitrarily bad with respect to the objective function compared to the optimal learn.\n",
    "In Smile, we use K-Means++ which addresses the second of these obstacles by specifying a procedure to initialize the cluster centers before proceeding with the standard K-Means optimization iterations. With the K-Means++ initialization, the algorithm is guaranteed to find a solution that is `O(log k)` competitive to the optimal K-Means solution.\n",
    "\n",
    "We also use K-D trees to speed up each K-Means step as described in the filter algorithm by Kanungo, et al."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "val clusters = kmeans(x, 6, runs = 20)\n",
    "show(plot(x, clusters.y, '.'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "K-Means works very well on Gaussian mixtures. If the clusters are elongated, however, the results may be far from optimal."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "val elongate = read.csv(\"../data/clustering/elongate.txt\", header = false, delimiter = \"\\t\").toArray()\n",
    "val clusters = kmeans(elongate, 2, runs = 20)\n",
    "show(plot(elongate, clusters.y, '.'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In K-Means, the number of clusters K has to be supplied by the user. However, the appropriate number of clusters is often unknown in practice. Several approaches (e.g. X-Means, G-Means, deterministic annealing, etc.) have been proposed to handle this challenge."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## X-Means\n",
    "\n",
    "X-Means clustering algorithm is an extended K-Means which tries to automatically determine the number of clusters based on BIC scores. Starting with only one cluster, the X-Means algorithm goes into action after each run of K-Means, making local decisions about which subset of the current centroids should split themselves in order to better fit the data. The splitting decision is done by computing the Bayesian Information Criterion (BIC)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "val clusters = xmeans(x, 50)\n",
    "show(plot(x, clusters.y, '.'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## G-Means\n",
    "\n",
    "G-Means clustering algorithm is another extended K-Means which tries to automatically determine the number of clusters by normality test. The G-Means algorithm is based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution. G-Means runs K-Means with increasing k in a hierarchical fashion until the test accepts the hypothesis that the data assigned to each K-Means center are Gaussian."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "val clusters = gmeans(x, 50)\n",
    "show(plot(x, clusters.y, '.'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Neither X-Means nor G-Means works well on the elongate data. Both report only one cluster."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Deterministic Annealing Clustering\n",
    "\n",
    "The observation of annealing processes in physical chemistry motivated the use of similar concepts to avoid local minima of the optimization cost. Certain chemical systems can be driven to their low-energy states by annealing, which is a gradual reduction of temperature, spending a long time at the vicinity of the phase transition points. In the corresponding probabilistic framework, a Gibbs distribution is defined over the set of all possible configurations which assigns higher probability to configurations of lower energy. This distribution is parameterized by the temperature, and as the temperature is lowered it becomes more discriminating (concentrating most of the probability in a smaller subset of low-energy configurations). At the limit of low temperature it assigns nonzero probability only to global minimum configurations.\n",
    "\n",
    "A known technique for nonconvex optimization that capitalizes on this physical analogy is simulated annealing based on the Metropolis algorithm. A sequence of random moves is generated and the random decision to accept a move depends on the cost of the resulting configuration relative to that of the current state. However, one must be very careful with the annealing schedule, i.e., the rate at which the temperature is lowered. In theory, the global minimum can be achieved if the schedule obeys `T ∝ 1 / log n`, where `n` is the number of the current iteration. Such schedules are not realistic in many applications. It was shown that perturbations of infinite variance (e.g., the Cauchy distribution) provide better ability to escape from minima and allow, in principle, the use of faster schedules.\n",
    "\n",
    "Deterministic annealing tries to enjoy the best of both worlds. On the one hand it is deterministic, meaning that we do not want to be wandering randomly on the energy surface while making incremental progress on the average, as is the case for simulated annealing. On the other hand, it is still an annealing method and aims at the global minimum, instead of getting greedily attracted to a nearby local minimum. One can view deterministic annealing as replacing stochastic simulations by the use of expectation. An effective energy function, which is parameterized by a (pseudo) temperature, is derived through expectation and is deterministically optimized at successively reduced temperatures.\n",
    "\n",
    "Deterministic annealing clustering is based on principles of information theory and probability theory, and it consists of minimizing the clustering cost at prescribed levels of randomness. The method provides soft clustering solutions at different scales, where the scale is directly related to the temperature parameter. For each temperature value, the algorithm iterates between the calculation of all posteriori probabilities and the update of the centroids vectors, until convergence is reached. There are \"phase transitions\" in the design process, where phases correspond to the number of effective clusters in the solution, which grows via splits as the temperature is lowered. The annealing starts with a high temperature. Here, all centroids vectors converge to the center of the pattern distribution (independent of their initial positions). Below a critical temperature the vectors start to split. Further decreasing the temperature leads to more splittings until all centroids vectors are separate. The annealing can therefore avoid (if it is sufficiently slow) the convergence to local minima. If a limitation on the number of clusters is imposed, then at zero temperature a hard clustering solution is obtained."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "val clusters = dac(x, 12, 0.9)\n",
    "show(plot(x, clusters.y, '.'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Sequential Information Bottleneck\n",
    "\n",
    "The Sequential Information Bottleneck (SIB) algorithm clusters co-occurrence data such as text documents vs words. SIB is guaranteed to converge to a local maximum of the information. Moreover, the time and space complexity are significantly improved in contrast to the agglomerative IB algorithm.\n",
    "\n",
    "In analogy to K-Means, SIB's update formulas are essentially same as the EM algorithm for estimating finite Gaussian mixture model by replacing regular Euclidean distance with Kullback-Leibler divergence, which is clearly a better dissimilarity measure for co-occurrence data. However, the common batch updating rule (assigning all instances to nearest centroids and then updating centroids) of K-Means won't work in SIB, which has to work in a sequential way (reassigning (if better) each instance then immediately update related centroids). It might be because K-L divergence is very sensitive and the centroids may be significantly changed in each iteration in batch updating rule.\n",
    "\n",
    "Note that this implementation has a little difference from the original paper, in which a weighted Jensen-Shannon divergence is employed as a criterion to assign a randomly-picked sample to a different cluster. However, this doesn't work well in some cases as we experienced probably because the weighted JS divergence gives too much weight to clusters which is much larger than a single sample. In this implementation, we instead use the regular/unweighted Jensen-Shannon divergence."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "val data = read.libsvm(\"../data/libsvm/news20.dat\")\n",
    "val sparse = (0 until data.size).map(i => data(i).x).toArray\n",
    "val clusters = sib(sparse, 20, 100, 8)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The news data in data/libsvm/news20.dat is a very sparse data of dimension 62061. The data contains 15935 samples. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## CLARANS\n",
    "\n",
    "The K-Medoids algorithm is an adaptation of the k-means algorithm. Rather than calculate the mean of the items in each cluster, a representative item, or medoid, is chosen for each cluster at each iteration. The K-Medoids algorithm attempts to minimize the distance between points labeled to be in a cluster and the medoid of that cluster. So a medoid is a most centrally located point in the cluster. K-Medoids works with an arbitrary matrix of distances between data points instead of L2. It is also more robust to noise and outliers as compared to K-Means.\n",
    "\n",
    "The most common realisation of K-Medoids clustering is the Partitioning Around Medoids (PAM) algorithm. PAM uses a greedy search which may not find the optimum solution, but it is faster than exhaustive search.\n",
    "\n",
    "CLARANS (Clustering Large Applications based upon RANdomized Search) is a more efficient medoid-based clustering algorithm. In CLARANS, the process of finding k medoids from n objects is viewed abstractly as searching through a certain graph. In the graph, a node is represented by a set of k objects as selected medoids. Two nodes are neighbors if their sets differ by only one object. In each iteration, CLARANS considers a set of randomly chosen neighbor nodes as candidate of new medoids. We will move to the neighbor node if the neighbor is a better choice for medoids. Otherwise, a local optima is discovered. The entire process is repeated multiple time to find better.\n",
    "```\n",
    "def clarans[T](data: Array[T], distance: Distance[T], k: Int, maxNeighbor: Int, numLocal: Int): CLARANS[T]\n",
    "```\n",
    "The parameter `maxNeighbor` specifies the maximum number of neighbors examined. The higher the value of `maxNeighbor`, the closer is CLARANS to PAM, and the longer is each search of a local minima. But the quality of such a local minima is higher and fewer local minima needs to be obtained."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "val clusters = clarans(x, new EuclideanDistance(), 6, 10, 20)\n",
    "show(plot(x, clusters.y, '.'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## DBSCAN\n",
    "\n",
    "DBSCAN (Density-Based Spatial Clustering of Applications with Noise) finds a number of clusters starting from the estimated density distribution of corresponding nodes.\n",
    "```\n",
    "// DBSCAN with a customized data structure for neighborhood search\n",
    "def dbscan[T](data: Array[T], nns: RNNSearch[T, T], minPts: Int, radius: Double): DBSCAN[T]\n",
    "\n",
    "def dbscan[T](data: Array[T], distance: Metric[T], minPts: Int, radius: Double): DBSCAN[T]\n",
    "\n",
    "// DBSCAN with Euclidean distance\n",
    "def dbscan(data: Array[Array[Double]], minPts: Int, radius: Double): DBSCAN[Array[Double]]\n",
    "``` \n",
    "DBSCAN requires two parameters: radius (i.e. neighborhood radius) and the number of minimum points required to form a cluster (minPts). It starts with an arbitrary starting point that has not been visited. This point's neighborhood is retrieved, and if it contains sufficient number of points, a cluster is started. Otherwise, the point is labeled as noise. Note that this point might later be found in a sufficiently sized radius-environment of a different point and hence be made part of a cluster.\n",
    "\n",
    "If a point is found to be part of a cluster, its neighborhood is also part of that cluster. Hence, all points that are found within the neighborhood are added, as is their own neighborhood. This process continues until the cluster is completely found. Then, a new unvisited point is retrieved and processed, leading to the discovery of a further cluster of noise.\n",
    "\n",
    "DBSCAN visits each point of the database, possibly multiple times (e.g., as candidates to different clusters). For practical considerations, however, the time complexity is mostly governed by the number of nearest neighbor queries. DBSCAN executes exactly one such query for each point, and if an indexing structure is used that executes such a neighborhood query in `O(log n)`, an overall runtime complexity of `O(n log n)` is obtained."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "val d4 = read.csv(\"../data/clustering/d4.txt\", header = false, delimiter = \"\\t\").toArray()\n",
    "val clusters = dbscan(d4, 19, 0.1)\n",
    "show(plot(d4, clusters.y, '.'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "With appropriate parameters, DBSCAN can discover the correct clusters and also identify outliers."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## DENCLUE\n",
    "\n",
    "DENCLUE (DENsity CLUstering) employs a cluster model based on kernel density estimation. A cluster is defined by a local maximum of the estimated density function. Data points going to the same local maximum are put into the same cluster. DENCLUE works efficiently for high-dimensional data sets and allows arbitrary noise levels while still guaranteeing to find the clustering.\n",
    "```\n",
    "def denclue(data: Array[Array[Double]], sigma: Double, m: Int): DENCLUE\n",
    "``` \n",
    "The parameter sigma is the smooth parameter in the Gaussian kernel. The user can choose sigma such that number of density attractors is constant for a long interval of sigma. The parameter m is the number of selected samples used in the iteration. This number should be much smaller than the number of data points to speed up the algorithm. It should also be large enough to capture the sufficient information of underlying distribution.\n",
    "\n",
    "Clearly, DENCLUE doesn't work on data with uniform distribution. In high dimensional space, the data always look like uniformly distributed because of the curse of dimensionality. Therefore, DENCLUDE doesn't work well on high-dimensional data in general."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "val clusters = denclue(x, 1.0, 50)\n",
    "show(plot(x, clusters.y, '.'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Spectral Clustering\n",
    "\n",
    "Given a set of data points, the similarity matrix may be defined as a matrix S where Sij represents a measure of the similarity between points. Spectral clustering techniques make use of the spectrum of the similarity matrix of the data to perform dimensionality reduction for clustering in fewer dimensions. Then the clustering will be performed in the dimension-reduce space, in which clusters of non-convex shape may become tight. There are some intriguing similarities between spectral clustering methods and kernel PCA, which has been empirically observed to perform clustering.\n",
    "```\n",
    "def specc(W: Array[Array[Double]], k: Int): SpectralClustering\n",
    "\n",
    "def specc(data: Array[Array[Double]], k: Int, sigma: Double): SpectralClustering\n",
    "\n",
    "// Nystrom approximation\n",
    "def specc(data: Array[Array[Double]], k: Int, l: Int, sigma: Double): SpectralClustering\n",
    "```    \n",
    "where W is the adjacency matrix of graph. The user may also provides the raw input data and the smooth/width parameter sigma of Gaussian kernel, which is a somewhat sensitive parameter. To search for the best setting, one may pick the value that gives the tightest clusters (smallest distortion, reported by the method distortion) in feature space. Spectral clustering is memory intensive because of the adjacency matrix. For large data, one may use Nystrom approximation by selecting some random samples. The parameter l specifies the number of random samples."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "val sincos = read.csv(\"../data/clustering/sincos.txt\", header = false, delimiter = \"\\t\").toArray()\n",
    "val clusters = specc(sincos, 2, 0.2)\n",
    "show(plot(sincos, clusters.y, 'o'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Minimum Entropy Clustering\n",
    "\n",
    "In this algorithm, the clustering criterion is based on the conditional entropy `H(C | x)`, where C is the cluster label and x is an observation. According to Fano's inequality, we can estimate C with a low probability of error only if the conditional entropy `H(C | x)` is small. Minimum Entropy Clustering (MEC) also generalizes the criterion by replacing Shannon's entropy with Havrda-Charvat's structural α-entropy. Interestingly, the minimum entropy criterion based on structural `α`-entropy is equal to the probability error of the nearest neighbor method when `α = 2`. To estimate `p(C | x)`, MEC employs Parzen density estimation, a nonparametric approach.\n",
    "\n",
    "This method performs very well especially when the exact number of clusters is unknown. The method can also correctly reveal the structure of data and effectively identify outliers simultaneously.\n",
    "```\n",
    "def mec[T](data: Array[T], distance: Distance[T], k: Int, radius: Double): MEC[T]\n",
    "def mec[T](data: Array[T], distance: Metric[T], k: Int, radius: Double): MEC[T]\n",
    "def mec[T](data: Array[T], nns: RNNSearch[T, T], k: Int, radius: Double, y: Array[Int]): MEC[T]\n",
    "def mec(data: Array[Array[Double]], k: Int, radius: Double): MEC[Array[Double]]\n",
    "``` \n",
    "MEC is an iterative algorithm starting with an initial partition given by any other clustering methods, e.g. K-Means, CLARNAS, hierarchical clustering, etc. Note that a random initialization is NOT appropriate."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "val clusters = mec(x, 20, 2.0)\n",
    "show(plot(x, clusters.y, '.'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that we use `k = 20` for this data and the algorithm still successfully finds the correct structure of data. In practice, we rarely know the right number of clusters in advance. With MEC, one may starts with a large k and the algorithm often can automatically remove unnecessary clusters and reach a lower entropy state."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Scala (2.13)",
   "language": "scala",
   "name": "scala213"
  },
  "language_info": {
   "codemirror_mode": "text/x-scala",
   "file_extension": ".sc",
   "mimetype": "text/x-scala",
   "name": "scala",
   "nbconvert_exporter": "script",
   "version": "2.13.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
